Introduction to Linear Regression

Data for Today

The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in North Carolina in 2004.

Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).

Draw a picture of how would you expect this dataset to look.

Relationships Between Variables

  • In a statistical model, we generally have one variable that is the output and one or more variables that are the inputs.
  • Response variable
    • a.k.a. \(y\), dependent
    • The quantity you want to understand
    • In this class – always numerical
  • Explanatory variable
    • a.k.a. \(x\), independent, explanatory, predictor
    • Something you think might be related to the response
    • Either numerical or categorical

Visualizing Linear Regression


  • The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”
  • It is a simple two-dimensional plot in which the two coordinates of each dot represent the values of two variables measured on a single observation.

Characterizing Relationships


  • Form (e.g. linear, quadratic, non-linear)

  • Direction (e.g. positive, negative)

  • Strength (how much scatter/noise?)

  • Unusual observations (do points not fit the overall pattern?)

Your Turn!

How would your characterize this relationship?

  • shape
  • direction
  • strength
  • outliers

What if you added another variable?

Summarizing a Linear Relationship


Correlation:

strength and direction of a linear relationship between two quantitative variables

  • Correlation coefficient between -1 and 1
  • Sign of the correlations shows direction
  • Magnitude of the correlation shows strength
births_post26 <- ncbirths %>% 
  drop_na(weight, weeks) %>% 
  filter(weeks > 26)


get_correlation(births_post26, 
                weeks ~ weight)
# A tibble: 1 × 1
    cor
  <dbl>
1 0.578

Anscombe Correlations

Four datasets, very different graphical presentations

  • same mean and standard deviation in both \(x\) and \(y\)
  • same correlation
  • same regression line

For which of these relationships is correlation a reasonable summary measure?

The Importance of Language

  • The word “correlation” has both a precise mathematical definition and a more general definition for typical usage in English.

  • These uses are obviously related and generally in sync.

  • There are times when these two uses can be conflated and/or misconstrued.

Linear Regression

  • Models are ubiquitous in Statistics!

  • We often assume that the value of our response variable is some function of our explanatory variable, plus some random noise.


  • In this case, we assume the relationship between \(x\) and \(y\) takes the form of a linear function.
    • \(response = intercept + slope \cdot explanatory + noise\)

Estimated / Fitted Regression Model

\[ \widehat{y} = b_0 + b_1 \cdot x \]

Why does this equation have a hat on y?

Coefficient Estimates

weeks_lm <- lm(weight ~ weeks, data = births_post26)
  
get_regression_table(weeks_lm)
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept   -5.34      0.565     -9.45       0   -6.45    -4.23 
2 weeks        0.325     0.015     22.2        0    0.296    0.354

Our focus (for now…)

Estimated regression equation

\[\widehat{y} = b_0 + b_1 \cdot x\]

# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept   -5.34      0.565     -9.45       0   -6.45    -4.23 
2 weeks        0.325     0.015     22.2        0    0.296    0.354


Write out the estimated regression equation!

How do you interpret the intercept value of -5.341?


How do you interpret the slope value of 0.325?

Obtaining Residuals

\(\widehat{weight} = -5.341+0.325 \cdot weeks\)


What would the residual be for a pregnancy that lasted 39 weeks and whose baby weighed 7.63 pounds?

A different explanatory variable

weight_gain_lm <- lm(weight ~ gained, data = births_post26)
  
get_regression_table(weight_gain_lm)
# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    6.74      0.102     65.8        0    6.54     6.94 
2 gained       0.015     0.003      4.79       0    0.009    0.021


Write out this estimated regression equation!

How would you choose which model is better?!

births_post26 %>% 
  drop_na(weight, gained, weeks) %>% 
  summarize(cor_weeks = cor(weight, weeks), 
            R_sq_weeks = cor_weeks^2, 
            cor_gained = cor(weight, gained),
            R_sq_gained = cor_gained^2)
# A tibble: 1 × 4
  cor_weeks R_sq_weeks cor_gained R_sq_gained
      <dbl>      <dbl>      <dbl>       <dbl>
1     0.572      0.327      0.153      0.0234

Categorical Explanatory Variables

Finding distinct levels:

distinct(births_post26, habit)
# A tibble: 2 × 1
  habit    
  <fct>    
1 nonsmoker
2 smoker   

Indicator Variables

\[ \hat{y} = b_0 + b_1 \cdot x \]

\(x\) is a categorical variable with levels:

  • "nonsmoker"
  • "smoker"

Need:

  • “baseline” mean
  • “offsets”

\(1_{smoker}(x) = 1\) if the mother was a "smoker"

\(1_{smoker}(x) = 0\) if the mother was a "nonsmoker"

A different equation

habit_lm <- lm(weight ~ habit, data = births_post26)
  
get_regression_table(habit_lm)
# A tibble: 2 × 7
  term          estimate std_error statistic p_value lower_ci upper_ci
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept         7.23     0.047    155.     0        7.14     7.32 
2 habit: smoker    -0.4      0.13      -3.07   0.002   -0.656   -0.145


What is the estimated mean birth weight for nonsmoking mothers?